{
 "cells": [
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "# 数据预处理\n",
    "在本步骤中，我们会：\n",
    "1. 读取多个CSV文件并合并成一个DataFrame\n",
    "2. 处理缺失值和无效值\n",
    "3. 对标签进行编码\n",
    "4. 将数据标准化\n",
    "5. 将数据分为训练集和测试集\n"
   ],
   "id": "5bdb485b929e3859"
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "initial_id",
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# 导入必要的库\n",
    "import os\n",
    "import pandas as pd\n",
    "from sklearn.preprocessing import StandardScaler, LabelEncoder\n",
    "from sklearn.model_selection import train_test_split"
   ]
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "### 导入库\n",
    "在本单元格中，我们导入了需要的库：\n",
    "- `os`：用于文件路径管理\n",
    "- `pandas`：用于数据操作和处理，例如加载和合并CSV文件\n",
    "- `StandardScaler` 和 `LabelEncoder`：用于数据标准化和标签编码\n",
    "- `train_test_split`：用于将数据分割成训练集和测试集\n",
    "\n",
    "**示例：**\n",
    "如果我们有一个CSV文件，可以通过 `pd.read_csv('file.csv')` 来读取数据，并用 `LabelEncoder` 对目标列进行编码。\n"
   ],
   "id": "a218d38edca7ae10"
  },
  {
   "metadata": {},
   "cell_type": "code",
   "outputs": [],
   "execution_count": null,
   "source": [
    "# 读取和合并CSV文件\n",
    "def load_and_merge_csv(folder_path):\n",
    "    # 获取文件夹中所有CSV文件的路径\n",
    "    csv_files = [f for f in os.listdir(folder_path) if f.endswith('.csv')]\n",
    "    combined_data = pd.DataFrame()  # 初始化一个空的DataFrame用于存放合并的数据\n",
    "\n",
    "    # 遍历每个CSV文件并读取数据\n",
    "    for file in csv_files:\n",
    "        file_path = os.path.join(folder_path, file)\n",
    "        data = pd.read_csv(file_path)\n",
    "        data.columns = data.columns.str.strip()  # 去除列名中的空格\n",
    "        combined_data = pd.concat([combined_data, data], ignore_index=True)  # 合并到总的DataFrame中\n",
    "    return combined_data\n"
   ],
   "id": "bea779f10ce7081"
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "### 读取和合并CSV文件\n",
    "该函数用于读取指定文件夹下所有CSV文件，并将它们合并为一个`DataFrame`。\n",
    "- `folder_path`：指定的文件夹路径。\n",
    "- `os.listdir(folder_path)`：获取文件夹中所有文件名。\n",
    "- `pd.read_csv(file_path)`：读取CSV文件。\n",
    "- `pd.concat()`：将多个`DataFrame`合并为一个。\n"
   ],
   "id": "f3e347fab5fc7acd"
  },
  {
   "metadata": {},
   "cell_type": "code",
   "outputs": [],
   "execution_count": null,
   "source": [
    "# 处理缺失值\n",
    "def handle_missing_values(data):\n",
    "    # 仅处理数值列的缺失值，填充为中位数\n",
    "    numeric_columns = data.select_dtypes(include=['number']).columns\n",
    "    data[numeric_columns] = data[numeric_columns].fillna(data[numeric_columns].median())\n",
    "    return data\n"
   ],
   "id": "6872f59afd4f8a11"
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "### 处理缺失值\n",
    "在这个函数中，我们处理数据中的缺失值，特别是数值列的缺失值。\n",
    "- `select_dtypes(include=['number'])`：选取数据中的数值列。\n",
    "- `fillna(median)`：用每列的中位数来填充缺失值，以减少异常值的影响。\n"
   ],
   "id": "eb7f1fe65da41f47"
  },
  {
   "metadata": {},
   "cell_type": "code",
   "outputs": [],
   "execution_count": null,
   "source": [
    "# 处理无效值（无穷大值和NaN）\n",
    "def handle_invalid_values(data):\n",
    "    data = data.replace([float('inf'), float('-inf')], float('nan'))  # 替换无穷大为NaN\n",
    "    data = data.fillna(0)  # 填充NaN值为0\n",
    "    return data\n"
   ],
   "id": "e1d02a48a4fdcbb2"
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "### 处理无效值\n",
    "此函数将数据中的无效值（无穷大值）替换为 `NaN`，并进一步用 `0` 填充所有 `NaN` 值。\n",
    "- `replace([float('inf'), float('-inf')], float('nan'))`：替换正无穷和负无穷为 `NaN`。\n",
    "- `fillna(0)`：填充所有的 `NaN` 值为 `0`。"
   ],
   "id": "31ebd9da89278e48"
  },
  {
   "metadata": {},
   "cell_type": "code",
   "outputs": [],
   "execution_count": null,
   "source": [
    "# 标签编码（将分类标签转换为数字）\n",
    "def encode_labels(data, label_column='Label'):\n",
    "    if label_column in data.columns:\n",
    "        label_encoder = LabelEncoder()\n",
    "        data[label_column] = label_encoder.fit_transform(data[label_column])  # 编码标签列\n",
    "        return data, label_encoder\n",
    "    else:\n",
    "        return data, None  # 如果没有指定的标签列，返回原数据\n"
   ],
   "id": "5a5232f7a2973b05"
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "### 标签编码\n",
    "该函数用于对数据中的目标列（标签列）进行编码，以便将非数值的类别标签转换为数值表示，这样模型可以直接使用。\n",
    "- `label_column`：指定标签列的名称，默认值为 `Label`。\n",
    "- `LabelEncoder()`：用于将类别标签转化为数值。\n",
    "- `fit_transform()`：直接将标签列的内容进行编码并替换。"
   ],
   "id": "6d2184caa6c71704"
  },
  {
   "metadata": {},
   "cell_type": "code",
   "outputs": [],
   "execution_count": null,
   "source": [
    "# 特征标准化\n",
    "def scale_features(data, feature_columns):\n",
    "    scaler = StandardScaler()\n",
    "    data[feature_columns] = scaler.fit_transform(data[feature_columns])  # 将特征缩放为标准正态分布\n",
    "    return data, scaler\n"
   ],
   "id": "6e1e5a73e5c55253"
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "### 特征标准化\n",
    "此函数用于对数据集的特征列进行标准化处理，将数据转换为零均值和单位方差，以保证不同特征的数值尺度一致。\n",
    "- `StandardScaler()`：对特征进行标准化处理。\n",
    "- `fit_transform()`：对数据进行拟合和转换，将数值标准化。"
   ],
   "id": "f7a37ae290a3a1b8"
  },
  {
   "metadata": {},
   "cell_type": "code",
   "outputs": [],
   "execution_count": null,
   "source": [
    "# 数据分割，将数据划分为训练集和测试集\n",
    "def split_data(data, label_column='Label', test_size=0.2):\n",
    "    X = data.drop(columns=[label_column])  # 特征集\n",
    "    y = data[label_column]  # 标签列\n",
    "    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=42)\n",
    "    return X_train, X_test, y_train, y_test\n"
   ],
   "id": "d6ac987edd7d50ff"
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "### 数据集划分\n",
    "该函数将数据集划分为训练集和测试集，分别用于模型的训练和验证。\n",
    "- `drop(columns=[label_column])`：将标签列（目标变量）从数据中移除，剩下的列作为特征。\n",
    "- `train_test_split()`：以 `80:20` 的比例（默认）划分数据集。"
   ],
   "id": "6f8e85d0150cf2e6"
  },
  {
   "metadata": {},
   "cell_type": "code",
   "outputs": [],
   "execution_count": null,
   "source": [
    "# 运行数据预处理流程\n",
    "folder_path = '../data/CICIDS2017/'\n",
    "output_dir = '../data/processed'\n",
    "combined_data = load_and_merge_csv(folder_path)  # 加载并合并CSV文件\n",
    "combined_data = handle_missing_values(combined_data)  # 处理缺失值\n",
    "combined_data = handle_invalid_values(combined_data)  # 处理无效值\n",
    "feature_columns = combined_data.select_dtypes(include=['number']).columns  # 选择数值列进行标准化\n",
    "combined_data, label_encoder = encode_labels(combined_data)  # 编码标签列\n",
    "\n",
    "if label_encoder is not None:\n",
    "    combined_data, scaler = scale_features(combined_data, feature_columns)  # 特征标准化\n",
    "    X_train, X_test, y_train, y_test = split_data(combined_data)  # 数据集划分\n",
    "\n",
    "    # 保存处理后的数据\n",
    "    os.makedirs(output_dir, exist_ok=True)\n",
    "    X_train.to_csv(f'{output_dir}/X_train.csv', index=False)\n",
    "    y_train.to_csv(f'{output_dir}/y_train.csv', index=False)\n",
    "    X_test.to_csv(f'{output_dir}/X_test.csv', index=False)\n",
    "    y_test.to_csv(f'{output_dir}/y_test.csv', index=False)\n",
    "    print(\"数据预处理完成并保存！\")\n",
    "else:\n",
    "    print(\"由于缺少 'Label' 列，跳过预处理。\")\n"
   ],
   "id": "52a7a111fb0250f9"
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "### 数据预处理流程说明\n",
    "\n",
    "此单元格执行数据预处理的主要步骤，以下是具体操作说明：\n",
    "\n",
    "1. **指定文件路径**：定义数据文件夹 (`folder_path`) 和保存预处理结果的文件夹 (`output_dir`) 路径。\n",
    "2. **加载并合并CSV文件**：调用 `load_and_merge_csv` 函数，读取并合并文件夹中的CSV数据文件。\n",
    "   - **示例**：`combined_data` 将包含所有CSV文件合并后的数据。\n",
    "3. **处理缺失值**：调用 `handle_missing_values` 函数，填补数据中的缺失值。\n",
    "4. **处理无效值**：使用 `handle_invalid_values` 函数，将无效数据替换为适当的默认值（如 `0` 或均值）。\n",
    "5. **选择数值列进行标准化**：提取数据中的数值列名，后续步骤中将对这些列进行标准化处理。\n",
    "6. **编码标签列**：使用 `encode_labels` 函数对标签列（`Label` 列）进行编码。\n",
    "   - **重要**：检查标签列是否存在，如果不存在则跳过后续步骤。\n",
    "7. **特征标准化**：使用 `scale_features` 函数，对数值列进行标准化处理，使其均值为0，方差为1。\n",
    "8. **数据集划分**：使用 `split_data` 函数将数据集划分为训练集和测试集。\n",
    "9. **保存处理后的数据**：将预处理后的训练集和测试集特征、标签数据保存到指定目录中。\n",
    "   - 数据文件包括 `X_train.csv`, `y_train.csv`, `X_test.csv`, `y_test.csv`。\n",
    "\n",
    "**示例输出**：\n",
    "执行完此单元格后，处理后的数据文件会保存在 `output_dir` 目录中。输出包括处理成功信息“数据预处理完成并保存！”或缺少标签列时的提示信息。\n",
    "\n",
    "这个单元格完成了整个数据预处理的过程，为后续模型训练和评估提供了清理和准备好的数据集。\n"
   ],
   "id": "aafbb046c1c1dd28"
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
